Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add link to MAX_RETRY allocation explain message #113657

Merged

Conversation

matthewabbott
Copy link
Contributor

Adds maximum number of retries exceeded reference link to the max_retry allocation explanations string.

Adds more detail to documentation page describing that this was done to protect the cluster, but the real cause of the issue may now be gone and so allocation can be retried.

Also adds POST to the example _cluster/reroute API in the explanation because some customers would use GET and be confused why it didn’t work.

@matthewabbott matthewabbott added >docs General docs changes >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team (obsolete) Team:Docs Meta label for docs team Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels Sep 27, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added v9.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 27, 2024
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Also you need to run ./gradlew precommit and fix up the issues.

docs/reference/cluster/allocation-explain.asciidoc Outdated Show resolved Hide resolved
Comment on lines 209 to 211
If no other `no` decisions are present, then the transient allocation issue
that caused these failures has most likely been resolved, and you can use the
<<cluster-reroute,the cluster reroute API>> to retry allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be confusing, there are normally always some no decisions e.g. for nodes in the wrong data tier.

Also I'd rather we used the imperative voice: "use the reroute API" rather than just suggesting "you can ...".

Finally there's a duplicate the (one inside the link and one outside).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming, I might say

Elasticsearch queues shard allocation retries in batches. If there are long running or a high quantity of shard recoveries occurring within the cluster, this process may time out for some shards resulting in MAX_RETRY. This surfaces infrequently but is expected to prevent infinite retries which may impact cluster performance. When encountered, run <<cluster-reroute,the cluster reroute API>> to retry allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to

Elasticsearch queues shard allocation retries in batches. If there are long-running shard
recoveries or a high quantity of shard recoveries occurring within the cluster, this
process may time out for some shards, resulting in max_retry. This surfaces infrequently
but is expected to prevent infinite retries which may impact cluster performance. When
encountered, run the <<cluster-reroute,cluster reroute>> API to retry allocation.

Which is basically identical but I tweaked the wording on the second sentence because I thought it sounded a bit clearer that way, and also moved 'the' and 'API' out of the link per suggestion from @DaveCTurner

Thanks!

@matthewabbott
Copy link
Contributor Author

Yikes about all those commits. I did not rebase this correctly.

@matthewabbott matthewabbott force-pushed the matthewabbott_allocation_max_retries_doc branch from 17f738f to 74905ea Compare October 4, 2024 20:38
@matthewabbott matthewabbott marked this pull request as ready for review October 14, 2024 17:03
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

docs/reference/cluster/allocation-explain.asciidoc Outdated Show resolved Hide resolved
the <<cluster-reroute,cluster reroute>> API to retry allocation.
Elasticsearch queues shard allocation retries in batches. If there are long-running shard
recoveries or a high quantity of shard recoveries occurring within the cluster, this
process may time out for some shards, resulting in `max_retry`. This surfaces infrequently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true, there's no timeout in play here. You need to get 5 genuine failures in a row before you see this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of changing this to

When Elasticsearch is unable to allocate a shard, it will attempt to retry allocation up to the 
maximum number of retries allowed. After this, Elasticsearch will stop attempting to allocate 
the shard in order to prevent infinite retries which may impact cluster performance. Run the 
<<cluster-reroute,cluster reroute>> API to retry allocation, which will allocate the shard if the 
issue preventing allocation has been resolved.

Are there any tweaks you’d like to make?/Does that seem reasonable?

@DaveCTurner DaveCTurner added auto-backport Automatically create backport pull requests when merged v8.17.0 labels Oct 18, 2024
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 9a8de1c into elastic:main Oct 18, 2024
15 checks passed
DaveCTurner pushed a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 18, 2024
lkts pushed a commit to lkts/elasticsearch that referenced this pull request Oct 18, 2024
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
elasticsearchmachine pushed a commit that referenced this pull request Oct 28, 2024
Backport of #113657 to `8.x`

Co-authored-by: matthewabbott <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >docs General docs changes external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team (obsolete) Team:Docs Meta label for docs team v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants